Overview: In this practical application, your goal is to compare the performance of the classifiers we encountered in this section, namely K Nearest Neighbor, Logistic Regression, Decision Trees, and Support Vector Machines. We will utilize a dataset related to marketing bank products over the telephone.
Our dataset comes from the UCI Machine Learning repository link. The data is from a Portugese banking institution and is a collection of the results of multiple marketing campaigns. We will make use of the article accompanying the dataset here for more information on the data and features.
To gain a better understanding of the data, please read the information provided in the UCI link above, and examine the Materials and Methods section of the paper. How many marketing campaigns does this data represent?
Use pandas to read in the dataset bank-additional-full.csv and assign to a meaningful variable name.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn import preprocessing
from tqdm import tqdm
from time import time
from sklearn.datasets import load_digits
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import BayesianRidge, LogisticRegression
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.dummy import DummyClassifier
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from statsmodels.tools.eval_measures import rmse
from sklearn.inspection import permutation_importance
from sklearn import svm
from sklearn.svm import SVC
from pylab import rcParams
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('data/bank-additional-full.csv', sep = ';')
df.head()
| age | job | marital | education | default | housing | loan | contact | month | day_of_week | ... | campaign | pdays | previous | poutcome | emp.var.rate | cons.price.idx | cons.conf.idx | euribor3m | nr.employed | y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 56 | housemaid | married | basic.4y | no | no | no | telephone | may | mon | ... | 1 | 999 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | no |
| 1 | 57 | services | married | high.school | unknown | no | no | telephone | may | mon | ... | 1 | 999 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | no |
| 2 | 37 | services | married | high.school | no | yes | no | telephone | may | mon | ... | 1 | 999 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | no |
| 3 | 40 | admin. | married | basic.6y | no | no | no | telephone | may | mon | ... | 1 | 999 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | no |
| 4 | 56 | services | married | high.school | no | no | yes | telephone | may | mon | ... | 1 | 999 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | no |
5 rows × 21 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 41188 entries, 0 to 41187 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 41188 non-null int64 1 job 41188 non-null object 2 marital 41188 non-null object 3 education 41188 non-null object 4 default 41188 non-null object 5 housing 41188 non-null object 6 loan 41188 non-null object 7 contact 41188 non-null object 8 month 41188 non-null object 9 day_of_week 41188 non-null object 10 duration 41188 non-null int64 11 campaign 41188 non-null int64 12 pdays 41188 non-null int64 13 previous 41188 non-null int64 14 poutcome 41188 non-null object 15 emp.var.rate 41188 non-null float64 16 cons.price.idx 41188 non-null float64 17 cons.conf.idx 41188 non-null float64 18 euribor3m 41188 non-null float64 19 nr.employed 41188 non-null float64 20 y 41188 non-null object dtypes: float64(5), int64(5), object(11) memory usage: 6.6+ MB
df.describe()
| age | duration | campaign | pdays | previous | emp.var.rate | cons.price.idx | cons.conf.idx | euribor3m | nr.employed | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 41188.00000 | 41188.000000 | 41188.000000 | 41188.000000 | 41188.000000 | 41188.000000 | 41188.000000 | 41188.000000 | 41188.000000 | 41188.000000 |
| mean | 40.02406 | 258.285010 | 2.567593 | 962.475454 | 0.172963 | 0.081886 | 93.575664 | -40.502600 | 3.621291 | 5167.035911 |
| std | 10.42125 | 259.279249 | 2.770014 | 186.910907 | 0.494901 | 1.570960 | 0.578840 | 4.628198 | 1.734447 | 72.251528 |
| min | 17.00000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | -3.400000 | 92.201000 | -50.800000 | 0.634000 | 4963.600000 |
| 25% | 32.00000 | 102.000000 | 1.000000 | 999.000000 | 0.000000 | -1.800000 | 93.075000 | -42.700000 | 1.344000 | 5099.100000 |
| 50% | 38.00000 | 180.000000 | 2.000000 | 999.000000 | 0.000000 | 1.100000 | 93.749000 | -41.800000 | 4.857000 | 5191.000000 |
| 75% | 47.00000 | 319.000000 | 3.000000 | 999.000000 | 0.000000 | 1.400000 | 93.994000 | -36.400000 | 4.961000 | 5228.100000 |
| max | 98.00000 | 4918.000000 | 56.000000 | 999.000000 | 7.000000 | 1.400000 | 94.767000 | -26.900000 | 5.045000 | 5228.100000 |
df.isnull().sum()
age 0 job 0 marital 0 education 0 default 0 housing 0 loan 0 contact 0 month 0 day_of_week 0 duration 0 campaign 0 pdays 0 previous 0 poutcome 0 emp.var.rate 0 cons.price.idx 0 cons.conf.idx 0 euribor3m 0 nr.employed 0 y 0 dtype: int64
Examine the data description below, and determine if any of the features are missing values or need to be coerced to a different data type.
Input variables:
# bank client data:
1 - age (numeric)
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone')
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)
Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')
df.shape
(41188, 21)
#Graph of Distribution of data
df.hist(figsize=(20,15), bins=25)
array([[<AxesSubplot:title={'center':'age'}>,
<AxesSubplot:title={'center':'duration'}>,
<AxesSubplot:title={'center':'campaign'}>],
[<AxesSubplot:title={'center':'pdays'}>,
<AxesSubplot:title={'center':'previous'}>,
<AxesSubplot:title={'center':'emp.var.rate'}>],
[<AxesSubplot:title={'center':'cons.price.idx'}>,
<AxesSubplot:title={'center':'cons.conf.idx'}>,
<AxesSubplot:title={'center':'euribor3m'}>],
[<AxesSubplot:title={'center':'nr.employed'}>, <AxesSubplot:>,
<AxesSubplot:>]], dtype=object)
sns.pairplot(df)
<seaborn.axisgrid.PairGrid at 0x1e947f37e50>
# Heat map to help show correlation matrix. Will allow aid in showing correlation between those numeric features
fig, ax = plt.subplots(figsize=(10,8))
sns.heatmap(df.corr(), annot=True)
<AxesSubplot:>
Oberservations
# Campaign Acceptance Rate based on Number of Employees
fig, ax = plt.subplots(figsize=(15,5))
sns.countplot(data=df, x='nr.employed', hue='y')
plt.title('Campaign Acceptance Rate based on Number of Employees')
Text(0.5, 1.0, 'Campaign Acceptance Rate based on Number of Employees')
Oberservations
# Campaign Acceptance Rate based on euribor3m
px.histogram(df, x='euribor3m', color='y', title='Campaign Acceptance Rate based on euribor3m', text_auto=True)
Oberservations
# Campaign Acceptance Rate based on Consumer Confidence Index
fig, ax = plt.subplots(figsize=(15,5))
sns.countplot(data=df, x='cons.conf.idx', hue='y')
plt.title('Campaign Acceptance Rate based on Consumer Confidence Index')
Text(0.5, 1.0, 'Campaign Acceptance Rate based on Consumer Confidence Index')
Oberservations
# Campaign Acceptance Rate based on Consumer Price Index
fig, ax = plt.subplots(figsize=(20,5))
sns.countplot(data=df, x='cons.price.idx', hue='y')
plt.title('Campaign Acceptance Rate based on Consumer Price Index')
Text(0.5, 1.0, 'Campaign Acceptance Rate based on Consumer Price Index')
Oberservations
# Campaign Acceptance Rate based on Employment Variation Rate
fig, ax = plt.subplots(figsize=(10,5))
sns.countplot(data=df, x='emp.var.rate', hue='y')
plt.title('Campaign Acceptance Rate based on Employment Variation Rate')
Text(0.5, 1.0, 'Campaign Acceptance Rate based on Employment Variation Rate')
Oberservations
# Campaign Acceptance Rate based on previous contact outcome
fig, ax = plt.subplots(figsize=(10,5))
sns.countplot(data=df, x='poutcome', hue='y')
plt.title('Campaign Acceptance Rate based on previous contact outcome')
Text(0.5, 1.0, 'Campaign Acceptance Rate based on previous contact outcome')
Oberservations
# Campaign Acceptance Rate based on previously contacted before
fig, ax = plt.subplots(figsize=(10,5))
sns.countplot(data=df, x='previous', hue='y')
plt.title('Campaign Acceptance Rate based on previously contacted before')
Text(0.5, 1.0, 'Campaign Acceptance Rate based on previously contacted before')
Oberservations
# Campaign Acceptance Rate based on pdays
fig, ax = plt.subplots(figsize=(8,5))
sns.countplot(data=df, x='pdays', hue='y')
plt.title('Campaign Acceptance Rate based on PDays')
Text(0.5, 1.0, 'Campaign Acceptance Rate based on PDays')
Oberservations
# Campaign Acceptance Rate based on Number of Contacts Performed
fig, ax = plt.subplots(figsize=(10,5))
sns.countplot(data=df, x='campaign', hue='y')
plt.title('Campaign Acceptance Rate based on Number of Contacts Performed')
Text(0.5, 1.0, 'Campaign Acceptance Rate based on Number of Contacts Performed')
Oberservations
# Campaign Acceptance Rate based on Contact Duration
px.histogram(df, x='duration', color='y', title='Campaign Acceptance Rate based on the Contact Duration')
Oberservations
# Campaign Acceptance Rate based on Day of Week
px.histogram(df, x='day_of_week', color='y', title='Campaign Acceptance Rate based on Last Contact Day of Week', text_auto=True)
Oberservations
# Campaign Acceptance Rate based on Month
px.histogram(df, x='month', color='y', title='Campaign Acceptance Rate based on Last Contact Month', text_auto=True)
Oberservations
# Campaign Acceptance Rate based on Contact Type
px.histogram(df, x='contact', color='y', title='Campaign Acceptance Rate based on Contact Type', text_auto=True)
Oberservations
# Campaign Acceptance Rate based on Personal Loan
px.histogram(df, x='loan', color='y', title='Campaign Acceptance Rate based on Personal Loan', text_auto=True)
Oberservations
# Campaign Acceptance Rate based on Housing Loan
px.histogram(df, x='housing', color='y', title='Campaign Acceptance Rate based on Housing Loan', text_auto=True)
Oberservations
# Campaign Acceptance Rate based on Credit Default
px.histogram(df, x='default', color='y', title='Campaign Acceptance Rate based on Credit Default', text_auto=True)
Oberservations
# Campaign Acceptance Rate based on Education
px.histogram(df, x='education', color='y', title='Campaign Acceptance Rate based on Education', text_auto=True)
Oberservations
# Campaign Acceptance Rate based on Marital Status
px.histogram(df, x='marital', color='y', title='Campaign Acceptance Rate based on Marital Status', text_auto=True)
Oberservations
# Campaign Acceptance Rate based on Job
px.histogram(df, x='job', color='y', title='Campaign Acceptance Rate based on Job', text_auto=True)
Oberservations
# Campaign Acceptance rate based on age
px.histogram(df, x='age', color='y', title='Campaign Acceptance Rate based on Age')
Oberservations
# Campaign Acceptance rate in the given dataset
#Out of the 34634 campaign responses (reduced outliers) recorded in this dataset, only 8.19% subscribed to a term deposit.
px.histogram(df, x='y', title='Campaign Acceptance Rate', text_auto=True)
Oberservations
After examining the description and data, your goal now is to clearly state the Business Objective of the task. State the objective below.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 41188 entries, 0 to 41187 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 41188 non-null int64 1 job 41188 non-null object 2 marital 41188 non-null object 3 education 41188 non-null object 4 default 41188 non-null object 5 housing 41188 non-null object 6 loan 41188 non-null object 7 contact 41188 non-null object 8 month 41188 non-null object 9 day_of_week 41188 non-null object 10 duration 41188 non-null int64 11 campaign 41188 non-null int64 12 pdays 41188 non-null int64 13 previous 41188 non-null int64 14 poutcome 41188 non-null object 15 emp.var.rate 41188 non-null float64 16 cons.price.idx 41188 non-null float64 17 cons.conf.idx 41188 non-null float64 18 euribor3m 41188 non-null float64 19 nr.employed 41188 non-null float64 20 y 41188 non-null object dtypes: float64(5), int64(5), object(11) memory usage: 6.6+ MB
Identification of Business Goals
Now that you understand your business objective, we will build a basic model to get started. Before we can do this, we must work to encode the data. Using just the bank information features (columns 1 - 7), prepare the features and target column for modeling with appropriate encoding and transformations.
#Identify and define the numerical and categorial columns
numerical_cols = ['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed']
categorical_cols = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'poutcome', 'y']
#Here we will use our Label Encoder to help digitize those columns which are not int/float
encoder_df = df.copy()
encoder = preprocessing.LabelEncoder()
def target_encoder(data):
impute_ordinal = encoder.fit_transform(data)
data.loc[data.notnull()] = np.squeeze(impute_ordinal)
return data
for i in tqdm(range(len(categorical_cols))):
target_encoder(encoder_df[categorical_cols[i]])
100%|██████████| 11/11 [00:01<00:00, 6.63it/s]
#As we have learned, next step is to scale the dataset
scaler = preprocessing.StandardScaler()
numeric_df = scaler.fit_transform(encoder_df.drop(['job','marital','education','default','housing','loan','contact','month','day_of_week','campaign',
'previous','poutcome','emp.var.rate','y'], axis=1))
scaled_df = pd.DataFrame(numeric_df, columns = ['age','duration','pdays','cons.price.idx','cons.conf.idx','euribor3m','nr.employed'])
scaled_df['job'] = encoder_df['job'].values
scaled_df['marital'] = encoder_df['marital'].values
scaled_df['education'] = encoder_df['education'].values
scaled_df['default'] = encoder_df['default'].values
scaled_df['housing'] = encoder_df['housing'].values
scaled_df['loan'] = encoder_df['loan'].values
scaled_df['contact'] = encoder_df['contact'].values
scaled_df['month'] = encoder_df['month'].values
scaled_df['day_of_week'] = encoder_df['day_of_week'].values
scaled_df['campaign'] = encoder_df['campaign'].values
scaled_df['previous'] = encoder_df['previous'].values
scaled_df['poutcome'] = encoder_df['poutcome'].values
scaled_df['emp.var.rate'] = encoder_df['emp.var.rate'].values
scaled_df['y'] = encoder_df['y'].values
scaled_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 41188 entries, 0 to 41187 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 41188 non-null float64 1 duration 41188 non-null float64 2 pdays 41188 non-null float64 3 cons.price.idx 41188 non-null float64 4 cons.conf.idx 41188 non-null float64 5 euribor3m 41188 non-null float64 6 nr.employed 41188 non-null float64 7 job 41188 non-null object 8 marital 41188 non-null object 9 education 41188 non-null object 10 default 41188 non-null object 11 housing 41188 non-null object 12 loan 41188 non-null object 13 contact 41188 non-null object 14 month 41188 non-null object 15 day_of_week 41188 non-null object 16 campaign 41188 non-null int64 17 previous 41188 non-null int64 18 poutcome 41188 non-null object 19 emp.var.rate 41188 non-null float64 20 y 41188 non-null object dtypes: float64(8), int64(2), object(11) memory usage: 6.6+ MB
With your data prepared, split it into a train and test set.
X = scaled_df.drop('y', axis=1)
y = scaled_df['y'].astype('int')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Before we build our first model, we want to establish a baseline. What is the baseline performance that our classifier should aim to beat?
models = ['Baseline','Logistic Regression', 'KNN', 'Decision Tree', 'SVM']
train_time = []
train_accuracy = []
test_accuracy = []
accuracy_score = []
AUC_score = []
dummy = DummyClassifier(strategy='uniform', random_state=42)
start_time = time()
dummy.fit(X_train, y_train)
train_time.append(time() - start_time)
train_accuracy.append(dummy.score(X_train, y_train))
test_accuracy.append(dummy.score(X_test, y_test))
accuracy_score.append('N/A')
AUC_score.append('N/A')
print("#################BASELINE ANALYSIS####################\n")
print(f'Training time :{train_time}')
print(f'Training accuracy :{train_accuracy}')
print(f'Test accuracy :{test_accuracy}')
print(f'Accuracy score : Not Yet Available')
print(f'AUC score : Not Yet Available')
print("\n######################################################")
#################BASELINE ANALYSIS#################### Training time :[0.0030012130737304688] Training accuracy :[0.5025840241406819] Test accuracy :[0.5020636076717649] Accuracy score : Not Yet Available AUC score : Not Yet Available ######################################################
Use Logistic Regression to build a basic model on your data.
lgr = LogisticRegression(solver='liblinear', random_state=42)
start_time = time()
lgr.fit(X_train, y_train)
train_time.append(time() - start_time)
y_pred = lgr.predict(X_test)
What is the accuracy of your model?
train_accuracy.append(lgr.score(X_train, y_train))
test_accuracy.append(lgr.score(X_test, y_test))
accuracy_score.append(metrics.accuracy_score(y_test, y_pred))
fpr, tpr, _thresholds = metrics.roc_curve(y_test, y_pred)
AUC_score.append(metrics.auc(fpr, tpr))
print("#################LOGISTIC REGRESSION ANALYSIS####################\n")
print(f'Training time :{train_time[1]}')
print(f'Training accuracy :{train_accuracy[1]}')
print(f'Test accuracy :{test_accuracy[1]}')
print(f'Accuracy score : {accuracy_score[1]}')
print(f'AUC score : {AUC_score[1]}')
print("\n######################################################")
#################LOGISTIC REGRESSION ANALYSIS#################### Training time :0.18693971633911133 Training accuracy :0.9102355103881239 Test accuracy :0.9101723719349356 Accuracy score : 0.9101723719349356 AUC score : 0.6906601848219758 ######################################################
cm_plot = sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, cmap='icefire', fmt='d')
cm_plot.set_xlabel('Predictions')
cm_plot.set_ylabel('Actuals')
plt.title("Logistic Regression")
Text(0.5, 1.0, 'Logistic Regression')
Now, we aim to compare the performance of the Logistic Regression model to our KNN algorithm, Decision Tree, and SVM models. Using the default settings for each of the models, fit and score each. Also, be sure to compare the fit time of each of the models. Present your findings in a DataFrame similar to that below:
| Model | Train Time | Train Accuracy | Test Accuracy |
|---|---|---|---|
| . | . |
from pandas.core.common import random_state
model_pipeline = []
model_pipeline.append(KNeighborsClassifier())
model_pipeline.append(DecisionTreeClassifier(random_state=42))
model_pipeline.append(SVC(random_state=42))
cm_results = []
model_pipeline
[KNeighborsClassifier(), DecisionTreeClassifier(random_state=42), SVC(random_state=42)]
for model in model_pipeline:
start_time = time()
model.fit(X_train, y_train)
train_time.append(time() - start_time)
y_pred = model.predict(X_test)
train_accuracy.append(model.score(X_train, y_train))
test_accuracy.append(model.score(X_test, y_test))
accuracy_score.append(metrics.accuracy_score(y_test, y_pred))
fpr, tpr, _thresholds = metrics.roc_curve(y_test, y_pred)
AUC_score.append(metrics.auc(fpr, tpr))
cm_results.append(confusion_matrix(y_test, y_pred))
fig = plt.figure(figsize=(12, 10))
for i in range(len(cm_results)):
cm = cm_results[i]
model = models[i+2]
sub = fig.add_subplot(2, 2, i+1).set_title(model)
cm_plot = sns.heatmap(cm, annot=True, cmap='icefire', fmt='d')
cm_plot.set_xlabel('Predictions')
cm_plot.set_ylabel('Actuals')
results1_df = pd.DataFrame({'Model': models, 'Train Time': train_time, 'Train Score': train_accuracy , 'Test Score': test_accuracy , 'Accuracy Score': accuracy_score, 'AUC Score': AUC_score})
results1_df
| Model | Train Time | Train Score | Test Score | Accuracy Score | AUC Score | |
|---|---|---|---|---|---|---|
| 0 | Baseline | 0.003001 | 0.502584 | 0.502064 | N/A | N/A |
| 1 | Logistic Regression | 0.186940 | 0.910236 | 0.910172 | 0.910172 | 0.69066 |
| 2 | KNN | 0.027991 | 0.925011 | 0.901594 | 0.901594 | 0.676082 |
| 3 | Decision Tree | 0.156949 | 1.000000 | 0.891236 | 0.891236 | 0.728408 |
| 4 | SVM | 12.101380 | 0.909889 | 0.907745 | 0.907745 | 0.64968 |
Now that we have some basic models on the board, we want to try to improve these. Below, we list a few things to explore in this pursuit.
For this task, I suppose that we can use GridSearch coupled with Logistic Regression to find importance of features and score it
models = ['Logistic Regression', 'KNN', 'Decision Tree','SVM']
best_params = []
fit_time = []
acc_score = []
recall_score = []
prec_score = []
f1_score = []
r2_score = []
roc_auc_score = []
grid={"C":np.logspace(-3,3,7), "penalty":["l1","l2"]}# l1-Lasso l2-Ridge
lgr = LogisticRegression()
lgr_cv=GridSearchCV(lgr, grid, cv=10)
lgr_cv.fit(X_train, y_train)
GridSearchCV(cv=10, estimator=LogisticRegression(),
param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03]),
'penalty': ['l1', 'l2']})
lgr_cv.best_params_
{'C': 0.1, 'penalty': 'l2'}
lgr2=LogisticRegression(C=0.1, penalty="l2", random_state=42)
start_time = time()
lgr2.fit(X_train,y_train)
train_time = time() - start_time
y_pred = lgr2.predict(X_test)
cm_plot = sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, cmap='icefire', fmt='d')
cm_plot.set_xlabel('Predictions')
cm_plot.set_ylabel('Actuals')
plt.title("Performance Confusion Matrix")
Text(0.5, 1.0, 'Performance Confusion Matrix')
best_params.append(lgr_cv.best_params_)
fit_time.append(train_time)
acc_score.append(metrics.accuracy_score(y_test, y_pred))
recall_score.append(metrics.recall_score(y_test, y_pred))
prec_score.append(metrics.precision_score(y_test, y_pred))
f1_score.append(metrics.f1_score(y_test, y_pred))
r2_score.append(metrics.r2_score(y_test, y_pred))
roc_auc_score.append(metrics.roc_auc_score(y_test, y_pred))
importance = lgr2.coef_[0]
for i,v in enumerate(importance):
print(f'Feature: {df.columns[i]} \t Score:{v}')
plt.bar([x for x in range(len(importance))], importance)
plt.show()
Feature: age Score:0.019773651500845636 Feature: job Score:1.1480340217808676 Feature: marital Score:-0.2078558144586879 Feature: education Score:0.5193231372629064 Feature: default Score:0.14228992049670514 Feature: housing Score:0.7529442279562094 Feature: loan Score:-0.6498014654605104 Feature: contact Score:0.004382178112697233 Feature: month Score:0.04993035947952877 Feature: day_of_week Score:0.05186935489892263 Feature: duration Score:-0.32081588303873804 Feature: campaign Score:-0.027658846014259653 Feature: pdays Score:-0.007435320557297817 Feature: previous Score:-0.7402021707006964 Feature: poutcome Score:-0.10179055837521613 Feature: emp.var.rate Score:0.06256460286642854 Feature: cons.price.idx Score:-0.04940064876712563 Feature: cons.conf.idx Score:-0.08982603542990603 Feature: euribor3m Score:0.4107291786538255 Feature: nr.employed Score:-0.9096948525250969
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 41188 entries, 0 to 41187 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 41188 non-null int64 1 job 41188 non-null object 2 marital 41188 non-null object 3 education 41188 non-null object 4 default 41188 non-null object 5 housing 41188 non-null object 6 loan 41188 non-null object 7 contact 41188 non-null object 8 month 41188 non-null object 9 day_of_week 41188 non-null object 10 duration 41188 non-null int64 11 campaign 41188 non-null int64 12 pdays 41188 non-null int64 13 previous 41188 non-null int64 14 poutcome 41188 non-null object 15 emp.var.rate 41188 non-null float64 16 cons.price.idx 41188 non-null float64 17 cons.conf.idx 41188 non-null float64 18 euribor3m 41188 non-null float64 19 nr.employed 41188 non-null float64 20 y 41188 non-null object dtypes: float64(5), int64(5), object(11) memory usage: 6.6+ MB
### Let's compare GridSearch and KNN To See What's Better
param_grid = dict(n_neighbors=list(range(1, 22)))
knn = KNeighborsClassifier()
knn_cv = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy', verbose=1)
knn_cv.fit(X_train, y_train)
Fitting 10 folds for each of 21 candidates, totalling 210 fits
knn_cv.best_params_
knn2=KNeighborsClassifier(n_neighbors=18)
start_time = time()
knn2.fit(X_train,y_train)
train_time = time() - start_time
y_pred = knn2.predict(X_test)
cm_plot = sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, cmap='icefire', fmt='d')
cm_plot.set_xlabel('Predictions')
cm_plot.set_ylabel('Actuals')
plt.title("Performance Confusion Matrix")
best_params.append(knn_cv.best_params_)
fit_time.append(train_time)
acc_score.append(metrics.accuracy_score(y_test, y_pred))
recall_score.append(metrics.recall_score(y_test, y_pred))
prec_score.append(metrics.precision_score(y_test, y_pred))
f1_score.append(metrics.f1_score(y_test, y_pred))
r2_score.append(metrics.r2_score(y_test, y_pred))
roc_auc_score.append(metrics.roc_auc_score(y_test, y_pred))
params = {'max_depth': [1, 3, 5, 7, 11, 13, 15, 17, 19, 21, 23],
'min_samples_split': [1, 50, 2],
'criterion': ['gini', 'entropy'],
'min_samples_leaf': [1]}
dtree = DecisionTreeClassifier()
dtree_cv = GridSearchCV(dtree, params, cv=10)
dtree_cv.fit(X_train, y_train)
dtree_cv.best_params_
dtree2=DecisionTreeClassifier(criterion='gini', max_depth=7, min_samples_leaf=1, min_samples_split=50, random_state=42)
start_time = time()
dtree2.fit(X_train,y_train)
train_time = time() - start_time
y_pred = dtree2.predict(X_test)
cm_plot = sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, cmap='icefire', fmt='d')
cm_plot.set_xlabel('Predictions')
cm_plot.set_ylabel('Actuals')
plt.title("Performance Confusion Matrix")
best_params.append(dtree_cv.best_params_)
fit_time.append(train_time)
acc_score.append(metrics.accuracy_score(y_test, y_pred))
recall_score.append(metrics.recall_score(y_test, y_pred))
prec_score.append(metrics.precision_score(y_test, y_pred))
f1_score.append(metrics.f1_score(y_test, y_pred))
r2_score.append(metrics.r2_score(y_test, y_pred))
roc_auc_score.append(metrics.roc_auc_score(y_test, y_pred))
dtree_coeffs = pd.Series(dtree2.feature_importances_, index=X_train.columns)
dtree_coeffs.plot(kind='bar')
plt.title('Decision Tree Coefficients')
plt.xlabel('X_train Values')
Observation
params = {'kernel': ['rbf', 'poly', 'linear', 'sigmoid'],
'gamma': [0.1, 1.0, 10.0, 100.0]}
svc = SVC()
svc_cv = GridSearchCV(svc, params, cv=10)
svc_cv.fit(X_train.sample(1000), y_train.sample(1000)) #To save time and computing resource, I applied gridsearch only to 1000 sample data.
svc_cv.best_params_
svc2=SVC(gamma=0.1, kernel='rbf')
start_time = time()
svc2.fit(X_train,y_train)
train_time = time() - start_time
y_pred = svc2.predict(X_test)
cm_plot = sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, cmap='icefire', fmt='d')
cm_plot.set_xlabel('Predictions')
cm_plot.set_ylabel('Actuals')
plt.title("Performance Confusion Matrix")
best_params.append(svc_cv.best_params_)
fit_time.append(train_time)
acc_score.append(metrics.accuracy_score(y_test, y_pred))
recall_score.append(metrics.recall_score(y_test, y_pred))
prec_score.append(metrics.precision_score(y_test, y_pred))
f1_score.append(metrics.f1_score(y_test, y_pred))
r2_score.append(metrics.r2_score(y_test, y_pred))
roc_auc_score.append(metrics.roc_auc_score(y_test, y_pred))
def f_importances(coef, names):
imp = coef
imp,names = zip(*sorted(zip(imp,names)))
plt.barh(range(len(names)), imp, align='center')
plt.yticks(range(len(names)), names)
rcParams['figure.figsize'] = 20,20
plt.show()
features_names = X_train.columns
svm = SVC(kernel='linear')
svm.fit(X_train, y_train)
f_importances(svm.coef_[0], features_names)
Observation
results2_df = pd.DataFrame({'Model': models, 'Best_Params': best_params, 'Train_Time': fit_time, 'Accuracy_Score': acc_score, 'Recall_Score': recall_score, 'Precision_Score': prec_score, 'F1_Score': f1_score, 'R2_Score': r2_score, 'ROC_AUC_Score': roc_auc_score})
results2_df